server : support unified cache across slots #16736

ggerganov · 2025-10-23T09:31:48Z

Current logic in this PR (subject to change):

When using unified KV cache with -kvu, share the entire context -c N among all parallel slots of the server -np N
When we run out of space, try to free some by purging old sequences from idle slots, one by one, in no particular order
If we still run out of space, terminate all active slots at once
The -np N argument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot context
By default, start the server using 4 slots and unified KV cache

Example:

llama-server -m model.gguf -c 8192 --jinja

TODO:

When we run out of space, terminate the active slots one-by-one and keep trying
~~Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391~~
Add tests

Future improvements:

When run out of space, terminate slots one by one instead of all together
Update logic for starting a new task to check that it has some extra room for generation (not very sure if needed, current logic will simply purge one of the other slots, so it should be good as it is)

slaren · 2025-10-23T13:46:10Z

src/llama-context.cpp


 uint32_t llama_context::n_ctx_per_seq() const {
-    return cparams.n_ctx / cparams.n_seq_max;
+    return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;


Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.

I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.

We can also cap it for the non-unified case?

Suggested change

return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;

return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

We can also cap it for the non-unified case?

What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.

Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:

llama.cpp/src/llama-model.cpp

Lines 19708 to 19724 in 7863fcc

const auto padding = llama_kv_cache::get_padding(cparams);

uint32_t n_ctx_per_stream = cparams.n_ctx;

if (!cparams.kv_unified) {

n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;

n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;

} else {

n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

cparams.n_ctx = n_ctx_per_stream;

}

LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

Since we no longer need the padding logic (as of #16148 and related) we should simplify this.

I'll push a separate PR for this and then will come back to polishing this one.

This is now rebased on top of the changes in #16812. The result is that we determine the KV cache size during context creation and there should be no leftover KV cells.

Note that since we now cap the context size to the training context size, the user code is recommended to query llama_n_ctx and llama_n_ctx_seq after creating the llama_context in order to obtain the actual context size. I'll add comments in llama.h to reflect this.

Will try to clean-up this PR next and will open it for review when ready.

ggerganov · 2025-10-29T14:14:57Z

src/llama-context.cpp

+    if (cparams.n_ctx_seq > hparams.n_ctx_train) {
+        LLAMA_LOG_WARN("%s: n_ctx_seq (%u) > n_ctx_train (%u) -- possible training context overflow\n",
+                __func__, cparams.n_ctx_seq, hparams.n_ctx_train);


This branch should not be reached due to the capping above on line 117. But keeping it in case the capping logic gets changed in the future.

ggerganov · 2025-10-30T18:41:43Z

Ready for review. I've marked some TODOs for follow-up PRs since I think the current implementation is quite basic and at the same time gets us 90% on the way to the ideal logic. Will improve the rest of the cases from master.

slaren · 2025-11-01T13:27:12Z

src/llama-context.cpp

+    if (cparams.kv_unified) {
+        cparams.n_ctx_seq = cparams.n_ctx;
+    } else {
+        cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max;
+    }
+
+    if (cparams.n_ctx_seq > hparams.n_ctx_train) {
+        LLAMA_LOG_WARN("%s: capping n_ctx_seq (%u) to n_ctx_train (%u)\n", __func__, cparams.n_ctx_seq, hparams.n_ctx_train);
+
+        cparams.n_ctx_seq = hparams.n_ctx_train;
+    }
+
+    if (cparams.kv_unified) {
+        cparams.n_ctx = cparams.n_ctx_seq;
+    } else {
+        cparams.n_ctx = cparams.n_ctx_seq * cparams.n_seq_max;
+    }


I am not completely convinced about this, I think it may create confusion, and add complexity to applications. The server and other applications using the unified cache need a sequence length limit independent of n_ctx, but that should probably be a different parameter that defaults to min(n_ctx, n_ctx_train). This would be an application parameter, not part of the llama.cpp API.

Sounds good. Just remove the capping here?

Yes, I think it would be preferable to not have a limit here. The user should be able to override the model n_ctx_train, and it is easier to do it this way than with a KV override.

Moved the capping logic to the llama-server.

ngxson · 2025-11-01T14:23:24Z

tools/server/server.cpp

+    if (params.n_parallel == 1 && params.kv_unified == false) {
+        LOG_WRN("%s: setting n_parallel = 4 and kv_unified = true\n", __func__);
+
+        params.n_parallel = 4;
+        params.kv_unified = true;
+    }


Is there a reason why this can't be default params in arg.h?

I'll see if I can make it the default - I thought that some of the examples might not like it.

Hmm yeah I didn't notice that there are multiple example all using n_parallel

In this case, maybe we can use a dedicated variable for server, like params.n_parallel_server ?

This can be useful when auto-generating the documentation for server args

ngxson · 2025-11-01T14:29:42Z

tools/server/server.cpp

+                    n_batch /= 2;
+                }

                SRV_WRN("failed to find free space in the KV cache, retrying with smaller batch size, i = %d, n_batch = %d, ret = %d\n", i, n_batch, ret);


this warning should be moved inside the if condition above, right?

Also maybe I forgot this from a discussion before, but currently in which case we need to retry with a small batch size?

The main case for retrying with smaller batches was back when we didn't have ggml_set_rows and we always had to search for contiguous set of cells (KV slots) inside the cache buffer to place the input batch. Now with ggml_set_rows this is no longer needed and technically, retrying with a smaller batch size almost has almost no purpose except in some rare cases.

But generally, when llama_decode returns 1, you should retry with a smaller batch.

ngxson · 2025-11-01T14:43:35Z

If we still run out of space, terminate all active slots at once

Hmm this could be a bit of a bummer in term of UX. For example this case:

User starts a slot and generate text, the slot using almost all of the context size
At the same time (when the first slot is still generating text), the user submit a new request which starts a new slot
Now both slot competes with each other, which eventually cause the second slot to be terminated too early

An idea for improvement could be to only allow starting a task when the remaining context passes a threshold (maybe free more than a half?), otherwise defer the task. (Ofc we can implement this in a follow-up PR)

ggerganov · 2025-11-01T14:59:21Z

Yes, there are several edge cases that can be handled better. This case specifically probably can be handled even better - when the total context gets filled up, move one active sequence to host-memory cache and resume it later when another sequence finishes. This way we don't need special thresholds and both sequences would eventually finish (after a short pause for one of them of course).

slaren · 2025-11-01T21:18:47Z

src/llama-context.cpp

+    if (cparams.kv_unified) {
+        cparams.n_ctx_seq = cparams.n_ctx;
+    } else {
+        cparams.n_ctx_seq = cparams.n_ctx     / cparams.n_seq_max;


Maybe an error could be returned here if n_ctx is not a multiple of n_seq_max, since that's likely to be a mistake.

I added a warning. The problem that I see with throwing an error is that the user might often want to use the default training context for example split among 3 sequences. And in the majority of cases the training context typically a power of 2 would not be divisible by 3, resulting in an error.

EverchangerL · 2025-11-02T20:05:08Z

Hi, by default there is unified KV cache with 4 slots, but setting only "--parallel 1" still uses 4 slots with unified KV cache
however, it uses 1 slot as "--kv-unified --parallel 1", or simply "--kv-unified"

is this correct that "--parallel 1" doesn't work as expected (should use 1 slot) without "--kv-unified"?

full command: llama-server.exe --model Prototype-X-12B-Q4_K_S.gguf --host 127.0.0.1 --port 5001 --ctx-size 16384 --gpu-layers 37 --threads 4 --threads-batch 6 --batch-size 256 --cache-type-k q4_0 --cache-type-v q4_0 --no-mmap --mlock --no-webui --cache-ram -1 --parallel 1

github-actions bot added examples server labels Oct 23, 2025

slaren reviewed Oct 23, 2025

View reviewed changes

github-actions bot added the python python script changes label Oct 23, 2025

ggerganov mentioned this pull request Oct 28, 2025

memory : remove KV cache size padding #16812

Merged

ggerganov force-pushed the gg/server-unified-slots branch 4 times, most recently from 55bb9db to 6369fe0 Compare October 28, 2025 10:50

github-actions bot added the testing Everything test related label Oct 28, 2025

ggerganov force-pushed the gg/server-unified-slots branch from 6369fe0 to ac261be Compare October 29, 2025 14:13

ggerganov commented Oct 29, 2025

View reviewed changes

ggerganov force-pushed the gg/server-unified-slots branch 2 times, most recently from 0ba88d3 to 4e9e319 Compare October 30, 2025 17:01

ggerganov marked this pull request as ready for review October 30, 2025 18:39

ggerganov requested review from CISC and ngxson as code owners October 30, 2025 18:39

ggerganov requested a review from slaren October 30, 2025 18:41

slaren reviewed Nov 1, 2025

View reviewed changes

ngxson reviewed Nov 1, 2025

View reviewed changes

ggerganov added 7 commits November 1, 2025 17:23

server : support unified context across slots

57ece5b

cont : fix speculative decoding initialization

a42fb77

context : fix n_ctx_per_seq computation

492f628

server : purge slots one by one

8222e9c

tests : add unified cache server tests

2179175

llama : update per-seq context computation

f0f105f

test-thread-safety : handle tiny training context of the input model

e7b7cbf

ggerganov added 5 commits November 1, 2025 17:23

server : fix server_tokens clear()

290f6a9

server : use 4 slots + unified KV by default

23323cd

llama : add note about context size queries

f2cca02

cont : update todos [no ci]

ff68436

context : do not cap the size of the context

c08d0d1

ggerganov force-pushed the gg/server-unified-slots branch from 93373cc to c08d0d1 Compare November 1, 2025 15:45

tests : adjust parameters to be CI friendlier

356dc08

slaren approved these changes Nov 1, 2025

View reviewed changes

context : add warning

56fceee

ggerganov merged commit cd5e3b5 into master Nov 2, 2025
68 of 74 checks passed

ggerganov deleted the gg/server-unified-slots branch November 2, 2025 16:14

	return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
	return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

	const auto padding = llama_kv_cache::get_padding(cparams);

	uint32_t n_ctx_per_stream = cparams.n_ctx;

	if (!cparams.kv_unified) {
	n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;
	n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

	cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;
	} else {
	n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

	cparams.n_ctx = n_ctx_per_stream;
	}

	LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

server : support unified cache across slots #16736

server : support unified cache across slots #16736

Conversation

ggerganov commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Oct 30, 2025

Uh oh!

slaren Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Nov 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EverchangerL commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented Oct 23, 2025 •

edited

Loading

slaren Nov 1, 2025 •

edited

Loading

ggerganov Nov 1, 2025 •

edited

Loading

ngxson commented Nov 1, 2025 •

edited

Loading

ggerganov Nov 2, 2025 •

edited

Loading

EverchangerL commented Nov 2, 2025 •

edited

Loading